A workflow is a series of tasks or programs executed in a specific order to achieve a goal. To automate the execution of these tasks, we use a workflow manager. This post introduces and provides a quick startup guide to Snakemake, a widely used workflow management system in bioinformatics.
🐍 Snakemake is a workflow manager that simplifies the creation and execution of workflows. Moreover, it offers robustness and scalability features.
🛠️ Setup
You can install Snakemake using the following command (make sure you have conda installed)
📝 Writing your first rule
To illustrate the use of Snakemake, we will write a rule to combine two CSV files into a single file.
We will see how to write a snakemake rule to concatenate two files: one.csv, two.csv.
file: one.csv
Name,Age,City
Alice,25,New York
Bob,30,London
Charlie,28,Sydney
David,35,Toronto
Emma,22,Berlin
file: two.csv
Name,Age,City
Frank,27,Paris
Grace,32,Rome
Hannah,29,Tokyo
Ian,40,Madrid
Jack,23,Dublin
We start by writing our first Snakemake rule to concatenate two CSV files. Each Snakemake rule follows a common syntax, which is shown below. We will now break down the rule and explain it in detail
rule <rule_name>:
input:
<input_file_1>,
<input_file_2>
output:
<output_file_1>,
...
...
run/shell:
"""
<commands to execute>
"""
Declare your rule name
The first line specifies a snakemake rule with specified name. Every rule name must be unique (i.e., it should not conflict with other rule names in your workflow).
Specify your input files
Next, we will specify the input & output files for the rule. Snakemake uses this information to determine dependencies among rules (e.g., to decide which rule should be executed next).
Do not forget comma after each input file when you have multiple input files.
Specify your output files
We will now specify our output file, i.e., third.csv.
✅ Snakemake only executes a rule when the output files are not available. In our case, when we run the workflow, Snakemake will automatically decide whether to execute a rule based on the availability of output files. 📂
🔄 If we execute the workflow a second time, Snakemake will not run the rule again because the output file is already there. 🎯
Specify your log file
It is a good practice to have a log file. It comes to very handy to troubleshooting errors when running a workflow with several rules.
Specify rule logic
This is where we will execute our commands to achieve the goal of the rule. We can write here Python
code or shell
commands. For our case, we need to write a logic to concatenate two csv files. Here, we will illustrates the use of Python and shell both.
Use run
for python codes and shell
for shell commands.
Complete Snakefile
Our final rule will look like the following.
First version of Snakefile
This is our first version of Snakefile.
rule concatenate_csv:
input:
'one.csv',
'two.csv'
output:
'third.csv',
run:
import pandas as pd
# Load csv files
first = pd.read_csv('one.csv')
second = pd.read_csv('two.csv')
# Concatenate files
third = pd.concat([first,second])
# Save output file
third.to_csv('third.csv')
Drawbacks of the Rule
Error Handling: Any error occurring during the execution of the Python code (e.g., File Not Found) is displayed only in the terminal. It would be better to store these errors in dedicated log files for each rule.
Flexibility: If we need to run the same workflow for different input files, we must manually modify multiple parts of the workflow, making it less adaptable. 🔄
Second version of Snakefile
Addressing the Drawbacks
In the second version, we will improve the workflow by making the following changes:
Modularizing the Code – We will move the Python code to a separate script file and execute it using shell. Additionally, we will redirect both standard error and standard output to a log file for better error tracking.
Enhancing Flexibility – Instead of hardcoding file names, we will store them in a configuration file and import them dynamically. This makes the workflow adaptable to different input files with minimal modifications.
Python script file
Preparing a python script for concatenating two files.
concatenate.py
import sys
import pandas as pd
# Get filenames from command-line arguments
file1 = sys.argv[1]
file2 = sys.argv[2]
output_file = sys.argv[3]
# Load csv files
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
# Concatenate files
df_combined = pd.concat([df1, df2])
df_combined.to_csv(output_file, index=False)
print(f"Successfully merged {file1} and {file2} into {output_file}")
Config file
Writing a configuration file.
Workflow updates
We will enhance the workflow with the following changes:
Import
config.yaml
📄 – We will extract file names dynamically from a configuration file instead of hardcoding them.Modify
input
andoutput
🔄 – The workflow will now use extracted filenames fromconfig.yaml
, making it more flexible.Add a
log
component 📜 – We will log all execution details for better debugging and tracking.Execute
concatenate.py
viashell
🖥️ – The script will be executed with input and output filenames passed as arguments.- ➡️ Additionally, both standard output and standard error will be redirected to the log file using
&>{log}
.
- ➡️ Additionally, both standard output and standard error will be redirected to the log file using
🚀 Execution
It is a good practice to make a sanity check of your rules. This can be done using the following command, known as dry-run.
The execution of this command will result in the following
snakemake -n
Building DAG of jobs...
Job stats:
job count
--------------- -------
concatenate_csv 1
total 1
[Fri Jan 31 16:33:56 2025]
rule concatenate_csv:
input: one.csv, two.csv
output: third.csv
log: concatenate.log
jobid: 0
reason: Code has changed since last execution
resources: tmpdir=/var/folders/hh/gyd1cnc93nj8sffbhmnpbrfr0000gn/T
Job stats:
job count
--------------- -------
concatenate_csv 1
total 1
Reasons:
(check individual jobs above for details)
code has changed since last execution:
concatenate_csv
Some jobs were triggered by provenance information, see 'reason' section in the rule displays above.
If you prefer that only modification time is used to determine whether a job shall be executed, use the command line option '--rerun-triggers mtime' (also see --help).
If you are sure that a change for a certain output file (say, <outfile>) won't change the result (e.g. because you just changed the formatting of a script or environment definition), you can also wipe its metadata to skip such a trigger via 'snakemake --cleanup-metadata <outfile>'.
Rules with provenance triggered jobs: concatenate_csv
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Now we have our Snakefile
ready. We will execute our workflow using the following command.
Running the workflow To run our workflow, we need to specify the number of cores. We do not need to specify the snakefile because snakemake automatically search for the file named Snakefile
in the current directory.
On a successfull execution, a new third file third.csv
will be creating consisting of records from both one.csv
and two.csv
.